MACS 30200
University of Chicago
\[Y_i = \alpha + \beta x_i + \epsilon_i\]
\[E(\epsilon_i) \equiv E(\epsilon | x_i) = 0\]
\[V(\epsilon | x_i) = \sigma_{\epsilon}^2\]
\[\epsilon \sim N(0, \sigma_\epsilon^2)\]
\(X\) is measured without error and independent of the error
\[\epsilon_i \sim N(0, \sigma_\epsilon^2), \text{for } i = 1, \dots, n\]
Influence
\[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]
\[h_i = \frac{1}{n} + \frac{(X_i - \bar{X})^2}{\sum_{i'=1}^{n} (X_{i'} - \bar{X})^2}\]
\[h_i = \mathbf{X}_i (\mathbf{X'X})^{-1} \mathbf{X'}_i\]
Residuals
\[V(E_i) = \sigma_\epsilon^2 (1 - h_i)\]
Standardized residuals
\[E'_i \equiv \frac{E_i}{S_{E} \sqrt{1 - h_i}}\]
\[S_E = \sqrt{\frac{E_i^2}{(n - k - 1)}}\]
Studentized residuals
\[E_i^{\ast} \equiv \frac{E_i}{S_{E(-i)} \sqrt{1 - h_i}}\]
\(\text{DFBETA}_{ij}\)
\[D_{ij} = \hat{\beta_j} - \hat{\beta}_{j(-i)}, \text{for } i=1, \dots, n \text{ and } j = 0, \dots, k\]
\(\text{DFBETAS}_{ij}\)
\[D^{\ast}_{ij} = \frac{D_{ij}}{SE_{-i}(\beta_j)}\]
Cook’s \(D\)
\[D_i = \frac{E^{'2}_i}{k + 1} \times \frac{h_i}{1 - h_i}\]
## term estimate std.error statistic p.value
## 1 (Intercept) -12.1034 2.5432 -4.76 6.57e-06
## 2 age 0.2189 0.0448 4.88 4.01e-06
## 3 tenure -0.0669 0.0643 -1.04 3.00e-01
## 4 unified 0.7176 0.4584 1.57 1.21e-01
Anything exceeding twice the average \(\bar{h}\) is noteworthy
## # A tibble: 9 × 10
## Congress congress nulls age tenure unified year hat student
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1st 1 0 49.8 0.8 1 1789 0.0974 0.330
## 2 3rd 3 0 52.8 4.2 0 1793 0.1132 0.511
## 3 12th 12 0 49.0 6.6 1 1811 0.0802 0.669
## 4 17th 17 0 59.0 16.6 1 1821 0.0887 -0.253
## 5 20th 20 0 61.7 17.4 1 1827 0.0790 -0.577
## 6 23rd 23 0 64.0 18.4 1 1833 0.0819 -0.844
## 7 34th 34 0 64.0 14.6 0 1855 0.0782 -0.561
## 8 36th 36 0 68.7 17.8 0 1859 0.1020 -1.072
## 9 99th 99 3 71.9 16.7 0 1985 0.0912 0.295
## # ... with 1 more variables: cooksd <dbl>
Anything outside of the range \([-2,2]\) is discrepant
## # A tibble: 7 × 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 67th 67 6 66.0 9.0 1 1921 0.0361 2.14 0.0415
## 2 74th 74 10 71.1 14.2 1 1935 0.0514 4.42 0.2229
## 3 90th 90 6 64.7 13.3 1 1967 0.0195 2.49 0.0292
## 4 91st 91 6 65.1 13.0 1 1969 0.0189 2.42 0.0269
## 5 92nd 92 5 62.0 9.2 1 1971 0.0146 2.05 0.0150
## 6 98th 98 7 69.9 14.7 0 1983 0.0730 3.02 0.1655
## 7 104th 104 8 60.6 12.5 1 1995 0.0208 4.48 0.0897
\[D_i > \frac{4}{n - k - 1}\]
## # A tibble: 4 × 10
## Congress congress nulls age tenure unified year hat student cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 67th 67 6 66.0 9.0 1 1921 0.0361 2.14 0.0415
## 2 74th 74 10 71.1 14.2 1 1935 0.0514 4.42 0.2229
## 3 98th 98 7 69.9 14.7 0 1983 0.0730 3.02 0.1655
## 4 104th 104 8 60.6 12.5 1 1995 0.0208 4.48 0.0897
## [1] 0.232
## [1] 0.258
## [1] 1.68
## [1] 1.29
\[\epsilon \sim N(0, \sigma_\epsilon^2)\]
## # A tibble: 3,997 × 4
## age sex compositeHourlyWages yearsEducation
## <int> <chr> <dbl> <int>
## 1 40 Male 10.56 15
## 2 19 Male 11.00 13
## 3 46 Male 17.76 14
## 4 50 Female 14.00 16
## 5 31 Male 8.20 15
## 6 30 Female 16.97 13
## 7 61 Female 6.70 12
## 8 46 Female 14.00 14
## 9 43 Male 19.20 18
## 10 17 Male 7.25 11
## # ... with 3,987 more rows
## term estimate std.error statistic p.value
## 1 (Intercept) -8.124 0.59898 -13.6 5.27e-41
## 2 sexMale 3.474 0.20701 16.8 4.04e-61
## 3 yearsEducation 0.930 0.03426 27.1 5.47e-149
## 4 age 0.261 0.00866 30.2 3.42e-180
## term estimate std.error statistic p.value
## 1 (Intercept) 1.0990 0.037965 28.9 1.97e-167
## 2 sexMale 0.2245 0.013121 17.1 2.16e-63
## 3 yearsEducation 0.0559 0.002171 25.7 2.95e-135
## 4 age 0.0182 0.000549 33.1 4.50e-212
\[\text{Var}(\epsilon_i) = \sigma^2\]
##
## studentized Breusch-Pagan test
##
## data: sim_homo_mod
## BP = 3, df = 1, p-value = 0.08
##
## studentized Breusch-Pagan test
##
## data: sim_hetero_mod
## BP = 100, df = 1, p-value <2e-16
\[ \begin{bmatrix} \sigma_1^2 & 0 & 0 & 0 \\ 0 & \sigma_2^2 & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \sigma_n^2 \\ \end{bmatrix} \]
\[ \mathbf{W} = \begin{bmatrix} \frac{1}{\sigma_1^2} & 0 & 0 & 0 \\ 0 & \frac{1}{\sigma_2^2} & 0 & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & 0 & \frac{1}{\sigma_n^2} \\ \end{bmatrix} \]
\[\hat{\mathbf{\beta}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{y}\]
\[\hat{\mathbf{\beta}} = (\mathbf{X}' \mathbf{W} \mathbf{X})^{-1} \mathbf{X}' \mathbf{W} \mathbf{y}\]
\[\sigma_{\epsilon}^2 = \frac{\sum(w_i E_i^2)}{n}\]
## term estimate std.error statistic p.value
## 1 (Intercept) -8.124 0.59898 -13.6 5.27e-41
## 2 sexMale 3.474 0.20701 16.8 4.04e-61
## 3 yearsEducation 0.930 0.03426 27.1 5.47e-149
## 4 age 0.261 0.00866 30.2 3.42e-180
## term estimate std.error statistic p.value
## 1 (Intercept) -8.099 0.01601 -506 0
## 2 sexMale 3.474 0.00977 356 0
## 3 yearsEducation 0.927 0.00142 653 0
## 4 age 0.261 0.00017 1534 0
## term estimate std.error statistic p.value std.error.rob
## 1 (Intercept) -8.124 0.59898 -13.6 5.27e-41 0.63615
## 2 sexMale 3.474 0.20701 16.8 4.04e-61 0.20725
## 3 yearsEducation 0.930 0.03426 27.1 5.47e-149 0.03849
## 4 age 0.261 0.00866 30.2 3.42e-180 0.00881
\[\epsilon \sim N(0, \sigma_\epsilon^2)\]
\[E_i^{(j)} = E_i + B_j X_{ij}\]
\[\log(\text{Wage}) = \beta_0 + \beta_1(\text{Male}) + \beta_2 \text{Age} + \beta_3 \text{Age}^2 + \beta_4 \text{Education}^2\]
## term estimate std.error statistic p.value
## 1 (Intercept) 0.396819 5.78e-02 6.87 7.62e-12
## 2 sexMale 0.221458 1.24e-02 17.79 3.21e-68
## 3 I(yearsEducation^2) 0.001805 7.86e-05 22.96 1.19e-109
## 4 age 0.083018 3.19e-03 26.05 2.93e-138
## 5 I(age^2) -0.000852 4.10e-05 -20.78 3.85e-91
Partial residuals
\[E_i^{\text{Age}} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i + E_i\]
\[E_i^{\text{Education}} = 0.002 \times \text{Education}^2_i + E_i\]
Partial fits
\[\hat{Y}_i^{(\text{Age})} = 0.083 \times \text{Age}_i -0.0008524 \times \text{Age}^2_i\]
\[\hat{Y}_i^{(\text{Education})} = 0.002 \times \text{Education}^2_i\]
Where explanatory variables are correlated with one another
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.403 -1.403 -0.495 1.339 6.072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.10768 2.84243 14.46 1.6e-14 ***
## disp 0.00747 0.01184 0.63 0.5332
## wt -3.63568 1.04014 -3.50 0.0016 **
## cyl -1.78494 0.60711 -2.94 0.0065 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.815
## F-statistic: 46.4 on 3 and 28 DF, p-value: 5.4e-11
##
## Call:
## lm(formula = mpg ~ disp + wt + cyl + disp_mean, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.403 -1.403 -0.495 1.339 6.072
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.10768 2.84243 14.46 1.6e-14 ***
## disp 0.00747 0.01184 0.63 0.5332
## wt -3.63568 1.04014 -3.50 0.0016 **
## cyl -1.78494 0.60711 -2.94 0.0065 **
## disp_mean NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared: 0.833, Adjusted R-squared: 0.815
## F-statistic: 46.4 on 3 and 28 DF, p-value: 5.4e-11
## term estimate std.error statistic p.value
## 1 (Intercept) -173.411 43.82839 -3.96 9.01e-05
## 2 age -2.291 0.67248 -3.41 7.23e-04
## 3 limit 0.173 0.00503 34.50 1.63e-121
## term estimate std.error statistic p.value
## 1 (Intercept) -377.5368 45.2542 -8.343 1.21e-15
## 2 limit 0.0245 0.0638 0.384 7.01e-01
## 3 rating 2.2017 0.9523 2.312 2.13e-02
Here it is very clear that limit and rating are strongly correlated with one another.
## age limit
## 1.01 1.01
## limit rating
## 160 160